Search CORE

28 research outputs found

Learning languages from parallel corpora

Author: Graën Johannes
Publication venue: Ljubljana University Press
Publication date: 29/12/2022
Field of study

This work describes a blueprint for an application that generates language learning exercises from parallel corpora. Word alignment and parallel structures allow for the automatic assessment of sentence pairs in the source and target languages, while users of the application continuously improve the quality of the data with their interactions, thus crowdsourcing parallel language learning material. Through triangulation, their assessment can be transferred to language pairs other than the original ones if multiparallel corpora are used as a source. Several challenges need to be addressed for such an application to work, and we will discuss three of them here. First, the question of how adequate learning material can be identified in corpora has received some attention in the last decade, and we will detail what the structure of parallel corpora implies for that selection. Secondly, we will consider which type of exercises can be generated automatically from parallel corpora such that they foster learning and keep learners motivated. And thirdly, we will highlight the potential of employing users, that is both teachers and learners, as crowdsourcers to help improve the material

ZORA

Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora

Author: Clematide Simon
Graën Johannes
Publication venue: Mannheim : Institut für Deutsche Sprache
Publication date: 02/07/2015
Field of study

The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as partof- speech tagging, lemmatisation, chunking, and dependency parsing facilitate precise querying of linguistic properties and can be used to extend word alignment to sub-sentential groups. Such highly interconnected data is stored in a relational database to allow for efficient retrieval and linguistic data mining, which may include the statistics-based selection of good example sentences. The varying information needs of contrastive linguists require a flexible linguistic query language for ad hoc searches. Such queries in the format of generalised treebank query languages will be automatically translated into SQL queries

Publikationsserver des Instituts für Deutsche Sprache

Exploring Properties of Intralingual and Interlingual Association Measures Visually

Author: Bless Christof
Graën Johannes
Publication venue: Linköping University Electronic Press, Linköpings universitet
Publication date: 24/05/2017
Field of study

We present an interactive interface to explore the properties of intralingual and interlingual association measures. In conjunction, they can be employed for phraseme identification in word-aligned parallel corpora. The customizable component we built to visualize individual results is capable of showing part-of-speech tags, syntactic dependency relations and word alignments next to the tokens of two corresponding sentences

ZORA

Binomials in Swedish corpora – ‘Ordpar 1965’ revisited

Author: Graën Johannes
Volk Martin
Publication venue: Department of Swedish, Multilingualism and Language Technology, University of Gothenburg
Publication date: 18/11/2022
Field of study

This paper describes a corpus study on Swedish binomials, a special type of multi-word expressions. Binomials are of the type "X conjunction Y" where X and Y are words, typically of the same part-of-speech. Bendz (1965) investigated the various use cases and functions of such binomials and included a list of more than 1000 candidates in his appendix. We were curious to what extent these binomials can still be found in modern corpora. We therefore checked this list against the Swedish Europarl and OpenSubtitles corpora. We found that many of the binomials are still in use today even in these diverse text genres. The relative frequency of binomials in Europarl is much higher than in OpenSubtitles

ZORA

SwissBERT: The Multilingual Language Model for Switzerland

Author: Graën Johannes
Sennrich Rico
Vamvas Jannis
Publication venue
Publication date: 14/06/2023
Field of study

We present SwissBERT, a masked language model created specifically for processing Switzerland-related text. SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland -- German, French, Italian, and Romansh. We evaluate SwissBERT on natural language understanding tasks related to Switzerland and find that it tends to outperform previous models on these tasks, especially when processing contemporary news and/or Romansh Grischun. Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work. The model and our open-source code are publicly released at https://github.com/ZurichNLP/swissbert

ZORA

NLP Corpus Observatory – Looking for Constellations in Parallel Corpora to Improve Learners’ Collocational Skills

Author: Graën Johannes
Schneider Gerold
Publication venue: Linköping Electronic Conference Proceedings
Publication date: 07/11/2018
Field of study

The use of corpora in language learning, both in classroom and self-study situations, has proven useful. Investigations into technology use show a benefit for learners that are able to work with corpus data using easily accessible technology. But relatively little work has been done on exploring the possibilities of parallel corpora for language learning applications. Our work described in this paper explores the applicability of a parallel corpus enhanced with several layers generated by NLP techniques for extracting collocations that are non-compositional and thus indispensable to learn. We identify constellations, i.e. combinations of intra- and interlingual relations, calculate association scores on each relation and, based thereon, a joint score for each constellation. This way, we are able to find relevant collocations for different types of constellations. We evaluate our approach and discuss scenarios in which language learners can playfully explore collocations. Our explorative web tool is freely accessible, generates collocation dictionaries on the fly, and links them to example sentences to ensure context embedding

ZORA

Crossing the Border Twice: Reimporting Prepositions to Alleviate L1-Specific Transfer Errors

Author: Graën Johannes
Schneider Gerold
Publication venue: Linköpings universitet Electronic Press
Publication date: 22/05/2017
Field of study

We present a data-driven approach which exploits word alignment in a large parallel corpus with the objective of identifying those verb- and adjective-preposition combinations which are difficult for L2 language learners. This allows us, on the one hand, to provide language-specific ranked lists in order to help learners to focus on particularly challenging combinations given their native language (L1). On the other hand, we provide extensive statistics on such combinations with the objective of facilitating automatic error correction for preposition use in learner texts. We evaluate these lists, first manually, and secondly automatically by applying our statistics to an error-correction task

ZORA

Multi-word Adverbs – How well are they handled in Parsing and Machine Translation?

Author: Graën Johannes
Volk Martin
Publication venue: s.n.
Publication date: 14/11/2017
Field of study

Multi-word expressions are often considered problematic for parsing or other tasks in natural language processing. In this paper we investigate a specific type of multi-word expressions: binomial adverbs. These adverbs follow the pattern adverb + conjunction + adverb. We identify and evaluate binomial adverbs in English, German and Swedish. We compute their degree of idiomaticity with an ordering test and with a mutual information score. We show that these idiomaticity measures point us to a number of fixed multi-word expressions which are often mis-tagged and mis-parsed. Interestingly, a second evaluation shows that state-of-the-art machine translation handles them well – with some exceptions

ZORA

Efficient Exploration of Translation Variants in Large Multiparallel Corpora Using a Relational Database

Author: Clematide Simon
Graën Johannes
Volk Martin
Publication venue: s.n.
Publication date: 28/05/2016
Field of study

We present an approach for searching and exploring translation variants of multi-word units in large multiparallel corpora based on a relational database management system. Our web-based application Multilingwis, which allows for multilingual lookups of phrases and words in English, French, German, Italian and Spanish, is of interest to anybody who wants to quickly compare expressions across several languages, such as language learners without linguistic knowledge. In this paper, we focus on the technical aspects of how to represent and efficiently retrieve all occurrences that match the user’s query in one of five languages simultaneously with their translations into the other four languages. In order to identify such translations in our corpus of 220 million tokens in total, we use statistical sentence and word alignment. By using materialized views, composite indexes, and pre-planned search functions, our relational database management system handles large result sets with only moderate requirements to the underlying hardware. As our systematic evaluation on 200 search terms per language shows, we can achieve retrieval times below 1 second in 75 % of the cases for multi-word expressions

ZORA

Exploiting alignment in multiparallel corpora for applications in linguistics and language learning

Author: Graën Johannes
Publication venue
Publication date: 01/01/2018
Field of study

ZORA